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Abstract: We propose to address the common problem of linear estima- 
tion in linear statistical models by using a model selection approach via 
penalization. Depending then on the framework in which the linear statis- 
tical model is considered namely the regression framework or the inverse 
problem framework, a data-driven model selection criterion is obtained ei- 
ther under general assumptions, or under the mild assumption of model 
idcntifiability respectively. The proposed approach was stimulated by the 
important recent non-asymptotic model selection results duo to Birge and 
Massart mainly (12), and our results in this paper, like theirs, are non- 
asymptotic and turn to be sharp. 

Our main contribution in this paper resides in the fact that these lin- 
ear estimators are not necessarily least-squares estimators but can be any 
linear estimators. The proposed approach finds therefore potential applica- 
tions in countless fields of engineering and applied science (image science, 
signal processing, applied statistics, coding, to name a few) in which one 
is interested in recovering some unknown vector quantity of interest as the 
one, for example, which achieves the best trade-off between a term of fi- 
delity to data, and a term of regularity or/and parsimony of the solution. 
The proposed approach provides then such applications with an interesting 
model selection framework that allows them to achieve such a goal. 
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Summary of the main results 

Let us fix first some of tlie notations tliat we shall use in the sequel. So, we 
consider the problem estimation of a vector quantity /3 = {P)k=i,p lying in 
some p— dimensional Euclidean subspace of (though our results extend easily 
to infinite dimensional Hilbert spaces) endowed with the traditional Euclidean 
norm || • j| defined as follows 

fe=i 

We shall also use the same notation, i.e., \\A\\ to mean the Euclidean norm of 
any q hy q real matrix A for some q G N, i.e., 

i=\ j=i 

Sometimes, we use for some q G N the notation Iq to mean the 5 by g iden- 
tity matrix, and the notation D{r]k}k=i.q to mean a q hy q diagonal matrix 
with diagonal elements 77^-, fc = 1, • • • , When wc write for some matrix A the 
following A^ , we mean the Moore-Penrose pseudo-inverse of matrix A. 

With this being said, we consider the classical linear gaussian regression 
model y = X (] + Rz, and we assume that one is given a collection of linear 
estimators of the solution /3 as follows C = {/3,„ := 'I'my,m £ Ai} parameter- 
ized by some set of models (parameters) M , that we consider finite in this paper 
though our results generalize easily to the case when M is countable-infinite. 
Such a collection of linear estimators can be obtained for instance by considering 
one (or more) of the frameworks presented in section 2. The goal is then to select 
among such a finite collection of linear estimators one estimator of /3 with the 
lowest quadratic risk (our results extend easily to other performance measures 
such as the weighted quadratic risk or the Mahalanobis risk). To do this, we 
firstly assume that when matrix X is rank-deficient, the solution of interest (3 
is linearly-identifiable in the linear model above; which means that one knows 
a-priori some phy n real matrix /C such that one can write f3 — ICXP = and 
we adopt a model selection procedure from a non-asymptotic point of view via 
penalization. Thus, we propose to select the estimator of /3 by minimizing over 
M a penalized criterion of the form 

Crit(m) = - 2/C^«'„)y + pen(m) 

where pen(m),m G A4 stands for some penalty function that depends exclu- 
sively upon a given model m through matrix but not upon data that we 
propose to address in the remainder of this paper, and we refer to the estimator 
denoted hy P ~ Pm with m = arginf„jg^ Crit(TO) as the penalized estimator 

iWhen X is of full rank, then one takes K := {X^X)-^X'^. 
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of /3. Then, we show that when the size of the collection of estimators C is not 
too great, one can set the penalty pen(m) in such a way to enforce an oracle 
inequality of the form 



E 



\\f3-P\\ 



< K min E 



11/3™ - f3\\ 



C 



for some universal additive constant C and some multiplicative constant K 
which depends on the complexity of the model collection A4 . 
More generally, especially when the size of the collection C can be great, then 
we show that another choice of pen(m) that takes into account the complexity 
of each model with respect to the collection of models A4-m a sense that we 
shall specify in the sequel- warrants to obtain a sharp inequality of the form 



E 



11/3 



< K min 



c 



with K and C standing for some reasonable universal constants, and T„i stands 
for some reasonable per-model positive quantity which is related to the com- 
plexity of a given model m with respect to the model collection A4. 



1. Introduction 

We assume the correlated linear statistical model 

y = XP + Rz (1.1) 

where y is a n-dimensional vector of observations, /3 is the p— dimensional vector 
parameter of interest, X and R are n by p and n by n design (known) matrices 
respectively, and we take z to be a p— dimensional vector of independent and 
identically distributed (i.i.d.) random variables N{0, 1). We are then interested 
in estimating the vector /? (resp. the response XP) given a single realization of 
the vector y by adopting a model selection approach via penalization. 

We propose to address such an estimation problem under rather general as- 
sumptions about the model parameters X, R, n and p and the solution /3. So, we 
regard /? as an unknown deterministic vector quantity, and we assume that ma- 
trix i? is a general n by n matrix, matrix X can be a well or an ill-conditioned 
matrix, the number of the observations (n) can be greater, equal or smaller 
than the size of the vector /3 (p), which therefore includes the case known as 
the "n < p" set up where on seeks to recover high resolution or sparse vector 
quantities f3 by using only a few measurements (18; 16). 

Our results in this paper are non-asymptotic; this means that we propose to 
work with the values of the parameters R, n and p of the linear model as they 
are, and we allow the number of models of the solution /3 to depend freely upon 
the of the solution (p) . This viewpoint as initiated in model selection by Barron, 
Birge and Massart (7) then refined by Birge and Massart (11; 12), needs to be 
contrasted with the asymptotic point of view (1; 23; 28; 29; 21) which considers 
for example that the number of the observations goes to infinity or that the 
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noise magnitude goes to zero while the number of models remains fix. As we 
shall see in the remainder, useful model selection criteria are directly connected 
to the complexity of a family of models, and by the latter, we roughly mean 
how large a model collection is; compared with the size {p) of the vector P 
^. Such a non-asymptotic property turns indeed to be very precious in many 
practical applications, mainly those which seek to recover a given vector quantity 
of interest by using a large library of models. We refer the interested reader to 
(12; 25) for a more thorough discussion on the topic. 

Non-asymptotic model selection by using penalization has become an impor- 
tant trend in statistical estimation in regression, and will certainly continue in 
fascinating many researchers either in statistics or in the engineering field and 
applied science for a long time. Early works in the field appeared indeed in the 
early nineties due to Barron and colleague (6) for discrete models, and exten- 
sions to continuous models were proposed by Barron, Birge and Massart (7). 
Aware of the pioneering works of Talagrand on concentration inequalities (31) 

Birge and Massart (11; 12) improved on Talgrand's works and refined the 
model selection approach in (7) and addressed nicely the problem of the estima- 
tion of the mean of a gaussian process in homoscedastic regression models when 
the variance is known (or estimated off-line), while taking into consideration 
the richness (complexity) of a collection of models, and which leads in some 
cases (when the model collection is not too big) to sharp oracle inequalities. 
Baraud (2; 4) and Baraud and colleagues (3) proposed many extensions to the 
aforementioned works that generalize the approach to non-gaussian homoscedas- 
tic statistical models; and under some mild assumptions about noise moments, 
amazingly they obtained near-oracle results. Baraud, Giraud and Huet (5) pro- 
posed recently penalized model selection criteria that are able to estimate the 
mean of gaussian homoscedastic models even when the variance is unknown, 
and they proved results for both the quadratic risk and the KuUbak risk. Very 
recently, Gendre (19; 20) extended their results in the case of a simultaneous 
estimation of the mean and the variance in gaussian heteroscedastic models. 
The latter work is probably the closest to our present work since it can han- 
dle heteroscedastic data as well, however, a subtle nuance exists between the 
two methods. Indeed while our method works with any linear estimators, the 
method of the author is based solely on least-squares estimators; which means 
that for each model m of the model collection (assumed to be some family of 
subspaces to which one can associate orthonormal bases), an estimator is con- 
structed as the closest p— dimensional real vector to data (in the sense of the 
squared error) by assuming that current model m is true. Though we have to 
admit that such a work constitutes a pretty important contribution to the field 

■^It is interesting to note that we do not say with (n) because, anyway, p and n are related 
since they have to satisfy generally that n >2 0( jpgjpy^ j ) where S stands for the maximum 
number of the non-zero components of the vector /3 or of its coefficients with respect to some 
fixed orthonormal basis, otherwise it is not generally possible to recover efficiently (3 even 
when the latter is some high-resolution signal(f4). 

^We refer the dear reader to (24; 25) for two beautiful lectures on the topic of concentration 
inequalities and their application in model selection. 
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of model selection in correlated regression models, the approach of the author 
might suffer for yielding the expected results for some instances of the vector 
P and of the noise matrix R especially when (3 has many nonzero coefficients 
with respect to any basis among the family of bases of the solution, and/or ma- 
trix R plays a preponderant role in the formula of the quadratic risk. However, 
when such a least-squares restriction is relaxed to allow the construction of non 
least-squares linear estimators of the solution (which is the case here), inter- 
estingly, strong priors about the sought solution /? can be inserted in a plenty 
of ways into its linear estimators (for instance, as the best trade-off between 
fidelity to data and regularity), consequently, one might limit considerably the 
influence of matrix R on the recovery process''^. Interestingly, this finds countless 
applications in the fields of engineering and applied science-witness the broad 
success of Bayesian regularization frameworks in signal and image restoration. 
We would not close this rather modest overview of related works without prob- 
ably pointing out one thriving field of non-asymptotic estimation in regression 
pioneered mainly by Candcs and collaborators (13-16) and currently explored by 
many research groups in statistics and engineering disciplines (image and signal 
processing, information and coding theory, etc.) and which is known as sparse 
statistical estimation or compressive sensing. In the latter, one tries to recover 
some vector quantity of interest assumed to be sparse or admits a parsimo- 
nious representation in a fixed orthonormal basis by considering criteria based 
on the £i-norm minimization by using linear programming concepts. The au- 
thors obtained consequently under some minimal assumptions about the model 
near-oracle inequalities of their estimation method. 

Before describing in details the main model selection results in this paper, 
we would like to open here a discussion to briefly point out our point of view 
regarding which performance measure of an estimator of /3 is better suited for 
some application; for sake of making our approach in this paper as much clear as 
possible, and for helping the dear reader set up more easily his/her model selec- 
tion framework. In fact, one has to distinguish generally between the following 
two frameworks in which the linear statistical model (1.1) can be considered 
which we briefly review: 

1. The linear regression framework: where the estimation of Xp is gen- 
erally, though not always^, the main concern for the statistician. Hence, 
the performance measure of any estimator (3 oi (3 which is used in this 
case is the predictive risk given by E[||X/3 — . 

2. The linear inverse problem framework: where the estimation of f3 is 

*We would like then to point out that this paper is devoted to linear estimation by model 
selection in a broad context, and another paper (8) which deals specifically with the problem 
of construction of non least-squares estimators that can take into account to some extent data 
heteroscedasticity in the goal of achieving more interesting balances between the bias and the 
variance terms in the formula of the quadratic risk is currently in the writing process by the 
author in the spirit of the present work. 

^because one would also use the regression framework to select an estimator of /3, but it 
is to use with some care to avoid any bad situation of selecting an estimator such that 
e[||X/3-X^||2] siO, but E [11/3- > 0. 
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the main concern for the statistician. Hence, the performance measure of 
any estimator /? of /? which is used in this case is the quadratic risk given 
byE[||/3-/?f]. 

However, one shows easily that w.Lg., even to consider a collection of linear 
estimators of X/3 as follows 

{XP,^ := Xl3rn ■■= X'^^„^y, meM} 

and use as a figure of merit of any estimator Xf3 := X(3 of XP its quadratic risk, 
which is simply the predictive risk of the estimator (3 of /3, one then finds oneself 
in the presence of the model selection problem we stated earlier in this section. 
Consequently, we shall address in the remainder the two problems, namely the 
linear regression problem and the linear inverse problem, in a same framework. 
However, in contrast to the regression set up, model identifiability might be an 
issue to consider in the inverse problem set up to derive useful model selection 
procedures, so we shall also provide some useful ideas that help overcome such 
an issue. 

Regarding now the appropriateness of either performance measure of any esti- 
mator of /3, it turns out indeed that in many engineering domains, the vector 
quantity of interest that has direct application is (3 and not XP (14). In this 
case, using the quadratic risk as a figure of merit of an estimator of P makes 
more sense than using the predictive risk. In image reconstruction/restoration 
applications for instance, P represents an image (e.g. an MRI scan of the heart 
or the brain of a patient) , X then models the design matrix of the imaging de- 
vice (e.g. the MRI scanner), and Rz models the stochastic errors of the imaging 
device. Obviously, the vector quantity of interest in this case is the image P that 
one would need to reconstruct for further use (for cardiac or brain diagnosis for 
example). However, when one is more interested in the estimation of the system 
response XP than P itself, then using the predictive risk as a figure of merit of 
an estimator sounds more interesting in this case. As some illustrative examples, 
one can mention the variable selection problem that we shall discuss in section 
2, and the reconstruction problem (for filtering or recognition purposes, etc.) 
by using a finite library of vector primitives. In the latter, one has generally an 
a-priori linear model of some vector quantity of interest u (a signal, an image, 
a curve, a shape, etc.) as follows u = XP, where {Xk, k = 1, • • • ,p} stand for 
some set of vector primitives (predictors) which, depending on the application, 
can be for example eigen objects (e.g. eigen images, eigen shapes, wavelets) or 
simply some database of generic objects, and P stands for the vector of the co- 
efficients of the linear combination of such vector primitives. One's goal is then 
to estimate the vector of the coefficients P in such a way to achieve the lowest 
predictive error. To cut a long story short, depending on the application, one 
may find good reasons to prefer the use of one performance measure from the 
use of its alternative (see for example subsection 3.2 for another reason that 
might justify the use of the predictive risk). 

Having said this, the rest of the paper is organized as follows. In section 2, we 
set up our model selection framework and we describe some applications that 
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can be expressed in terms of our framework. In section 3, we derive the main 
data-driven model selection results in this paper, and we provide some useful 
clues that help choose the penalty function and its parameters. In section 4, 
we discuss some practical solutions that can lead to overcome the identifiability 
issue of model (1.1) when the rank of matrix X is smaller than p. Section 5 
is devoted to the numerical experiments that show the performances of the 
approach for some application examples. Finally, a general discussion about the 
proposed approach and its future extensions concludes this papers. 

2. Collection of linear estimators of (3 

We assume that one is given typically a finite collection of linear estimators of the 
solution P which is parameterized by some finite set of models (or parameters) 
M as follows 

{/3m:^'f my,me M} (2.1) 

with standing for some p by n matrix for all m S Ai, and the goal is to 
select among such a family of linear estimators | ,m £ one estimator of 
P with the lowest quadratic risk. 

Concerning the way in which these linear estimators are constructed, this 
depends generally on the application and on the prior that one has about the 
solution /3, and some examples of their construction encountered in countless en- 
gineering domains (signal/image processing, computer vision, applied statistics, 
to name a few) are highlighted below. 

2.1. Reconstruction/ Restoration by regularization 

In many image and signal reconstruction/restoration applications, for recovering 
some vector quantity of interest namely (3 from an observation of model (1-1), 
one proceeds by solving an unconstrained quadratic problem of the form 

(y - X(3) ^ P(y - XP) + (3^Hj3 min (2.2) 

where P and H stand for two given n by n and p hy p symmetric matrices 
respectively, both considered generally to be positive semi-definite. The two 

competing terms (^y—X(3j p(^y—Xf3j and (3'^ Hf3 in (2.2) in (2.2) are generally 

referred as the data fidelity term, and the regularity term respectively. As an 
illustrative example, when [3 is assumed to be some smooth vector quantity, then 
H is taken generally to be some regularization operator (e.g. a high-pass filter 
such as a differential operator or any linear combination of differential operators 
of different orders, a band-pass filter, etc.) in order to enforce smoothness of the 
recovered solution, and P is generally taken as the inverse of the covariance 
matrix of the observations, i.e., P = i^W li) or simply as the n by n identity 
matrix, i.e., P = !„ ■ This is also known in applied science as the Tikhonov 
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regularization problem (33) and which has Baycsian interpretation (maximum 
of posterior log- likelihood of data in a gaussian set up). Solving now for [3 in 
(2.2) gives 

P = {X^PX + H)^X'^Py 

However, it is quite common that matrix H and probably matrix P too (useful 
if one wants for example to test the performance of the linear estimators against 
various Mahalanobis distances between data and an estimator's response) de- 
pend on some parameters (referred in engineering as regularization parameters) 
that are difficult to tune optimally for a particular application. Therefore, one 
assumes a finite collection of parametric couples of instances of the matrices P 
and H as follows {[Pm, Hm) e M}, and by putting for all m & M. 

*,n := (^X^PmX + H„?j^X^Pm 

one obtains finally a finite parametric collection of linear estimators of /? of the 
form I Pm '■= ^mV, m G I . The goal is then to select among such a parametric 
collection of linear estimators of P one estimator with the lowest risk. One then 
finds oneself in the presence of the model selection framework that we set up in 
the beginning of this section. 



2.2. Optimal representation in a family of orthonormal bases 



Some applications that attempt to estimate P from an observation of model 
(1.1) assume that P has a sparse representation in a given family of orthonormal 
bases in of different complexities (i.e., dimensions) A ~ {Am,m G A4} (e.g. 
trigonometric bases, a wavelet family, etc.), and let us denote by the matrix 
which rows correspond to the respective vectors of the complement basis in 
of the orthonormal basis A,„, for all m G A4. So, these applications proceed 
indeed by solving for all m Cz Ai a constrained quadratic program of the form 



{y-XPmyPm{y-XPm) 



mm 



(2.3) 



where stands for a n— dimensional symmetric matrix. Let us put 

Cm = (X^P„,X + $f„¥™)^ 

One checks that the optimal estimator with respect to m G is given by 



Pn 



Cn 



,{^mCm^Z) 



x'Pmy 



-^T — 



moreover, if matrix (^X^PmX + $^<I>„i) is of full rank (i.e., p), then such an 
estimator is the unique solution of (2.3). Let us now put for all to G 



In 



^m.\^mCm^m) 



X^P„ 
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hence one obtains a finite collection of linear estimators of /3 which is given by 
|/3m '■= 'J'm!/j™ G 7\4|, among which one wants to select one estimator with 
the lowest risk. One retrieves again our model selection framework. 

Remark 1. The formula of the solution (3m of (2.3) and all those that shall 
come up later in this section remain valid for arbitrary matrices Pm and $m, 
and it is not definitely necessary to assume that the row vectors of are 
orthonormal (hence when <^„i is orthogonal, one may replace $,„ with Ip — 
$^<f>m^- Please note that one can approximate the formula of (3„i by a simpler 
formula for some large enough positive number /x as follows $m ~ {X^PmX + 
^I'^l^mYx^Pmy. 

2.3. Optimal representation in a family of orthonormal bases with 
regularization 

It happens that the solution (3 that one looks to estimate from an observation of 
model (1.1) has a sparse representation in some family A of orthonormal bases of 
some subspaces in R^* which we write as A = {Am', m' e and has some 

regular (e.g. smooth) structure. So, let us denote for all m' S by 
the matrix which rows correspond to the respective vectors of the orthonormal 
basis Am', and by the matrix which rows correspond to the respective 
vectors of the complement basis Am' in R'' of A,„' and put A = {Am',m' G 

recover the solution of p in the orthonormal family 
A as parsimoniously as possible, while imposing either on f3 or on the vector of 
coefficients of /? in any orthonormal basis Am' € A to have a regular profile (e.g. 
smoothness), but the latter is generally difficult to guess beforehand. Hence, one 
would like to consider on top of A a finite parametric collection Ti. of regularizing 
operators with increasing regularization powers as follows 7i = | Hg ,6 E } , and 
depending on whether the smoothing is applied on f3 directly or on its vector of 
coefficients with respect to any orthonormal basis Am' G A, an operator Hg E Ti 
may depend or not on a basis A^'- Nonetheless, for the sake of simplicity and 
without loss of generality (w.l.g.) (see remark 2 below), one can consider that 
for all e O, Hg is p by p symmetric matrix and that one is given a family 
of models (quadruplets) of the form | (A,„, A,„, i7„i, P„j), m E M} where for 
all m E Ai, Am E A, Am E A, Hm E H, and Pm is some n by n symmetric 
matrix which enforces closeness of an estimator to data. Then with respect to 
all m G A^, one solves a constrained quadratic program of the form 

(y - XP] ^Pm (y - Xp) + p'^HmP ^ min 
Let us put for all m e 

Cm = {X^P^X + + Bmf 
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One checks (see the proof in the appendix section) that the optimal estimator 
with respect to m G is given by 



■^T /-r 



and one can approximate the latter for a large enough positive number fi as 
follows 



(3m ~ {X PmX 

Let us now put for all m G 



v& ■= C 



1 2/ 



One obtains in the end the following finite collection of linear estimators of /3: 
|/3m := ^mV^rii G among which one would want to select one estimator 
with the lowest risk. One retrieves again our model selection framework. 

Remark 2. When the smoothness constraint is imposed rather on the coeffi- 
cients of an estimator (3m in a given orthonormal basis Am G A, one can proceed 
in the same way as above by defining for all m €z Ai the new filter Hm as follows 
Hm ^m^m^m with Fm standing for the filter which operates on the vector 
of the coefficients of Pm with respect to Am, and Hm standing for the new filter 
that operates directly on the reconstructed solution Pm- 



2.4- Optimal linear filtering 

Linear filtering (low-pass, high-pass, band-pass, band-stop, etc.) is an impor- 
tant pre-processing task in various signal and image processing applications. 
For instance, low pass filtering (e.g. gaussian filtering) has become a standard 
step of the image processing chain which aims at removing white noise from 
a noisy image y in order to improve its quality (measured generally as peak 
signal-to-noise ratio (PSNR)) and simplify its further analysis and interpreta- 
tion. However, linear filters depend generally on some parameters (e.g. a filter's 
bandwidth) which are difficult to tune optimally for a particular signal or an 
image. Hence one would need to consider a finite collection of parametric linear 
filters as follows {'^m, m G A4} from which one would like to select the one with 
the best parameter to apply on the noisy image/signal y. Such a problem can be 
formulated indeed in terms of our model selection framework by considering a 
finite collection of linear estimators of the noiseless signal or image P as follows 
{Pm ■= '^mU, m G M}, therefore the goal amoimts to selecting one estimator of 
P with the lowest quadratic risk. 

2.5. Variable selection in regression 

In this problem, one commonly assumes a collection of possible configurations 
C = |i^,ri, m G of /3, where each configuration v stands for a binary vector 
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of the same size as /3, and a component = 1 means that the predictor Xk (i.e., 
the k — th column of matrix X) is not part of the regression model (i.e., I'fc = 1, 
Pk = 0), otherwise it is said to be part of the regression model (i.e., i/^ = 0, 
f3k 7^ 0), and the goal is to figure out the configuration of f3 that achieves lowest 
predictive error (see below for more insight on the use of the predictive risk), 
in other words, the most influential variables (i.e., those with most explanatory 
power) in model (1.1) among the set of variables {Xk,k — 1, • • • So, let 
us denote for all to e by Nm the p— dimensional diagonal matrix such that 
Nmik, k) = i^fc, fc = 1, • • • ,p. Hence one proceeds by minimizing with respect to 
each configuration Vm G C a constrained quadratic program of the form 

y - Xfirr}] Prn ill ~ X^rr^ ^ miu 

^ ^ ^ /3/Af„,/3=0 

for some n— dimensional symmetric matrix P„i to achieve finally a collection of 
linear estimators of (3 as follows: |/3,„ ^my,Tn £ where for all m G A^, 
one has 5'„i which is given by 



with 



t 



C'ra = iX^PmX + 



We would like to emphasize that our goal in the remainder is not to address the 
problem of construction of the matrices P^, N,n for a given application 

since they are application-dependent and they are constructed from the a-priori 
knowledge about the solution as we already mentioned it above ^. Nevertheless, 
our goal~once a finite collection of them has been chosen-is to provide the user 
with an efficient model selection tool that allows him/her to choose the almost 
best ones to recover the solution. 

Before starting in addressing such a model selection issue, we would like to 
enunciate upfront the following proposition which gives the exact formula of the 
ideal linear estimator of /3 with respect to the quadratic risk (see formula (3.2) 
in section 3). 

Proposition 1. The optimal linear estimator [3* that minimizes in TV the 
formula of the quadratic risk is given by 

(3* ^(3{Xpf({XP){Xpf + RR^)\ 

®in fact, the experts in these domains have already done a great job in this respect, and a 
overview of the subject in this paper would only be a weak copy of their previous findings . . . 
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and its quadratic risk is given by 



E 



11/3* -/?irj = m 

iXpf(^{Xl3){Xpf + RR^y {Xl3){XI3f(^{Xl3){XI3f + RR^^ {Xp) 
2{X(3f[{XI3){XI3f + + ({XI3){XI3f + RR^^ {XI3)R 



1 

+1 



Proof. The proof of this proposition consists of a simple minimization of formula 
(3.3) of section 4 which thus amounts to solving an unconstrained quadratic 
program. □ 

Remark 3. As an aside, it is interesting to note that proposition 1 says among 
other things that the solution with the lowest risk of model (1-1) corresponds 
to the least-norm solution of y' = X/3, which is nothing else than X^y' = 
X^Xp. However, the latter does not correspond generally to the solution that 
one is looking to estimate, hence the estimation of the actual solution would 
have unavoidably a bigger risk. 

Though such a result of proposition (1) is of little use in practice since it says 
that such an ideal filter depends on the unknown (3 itself, nevertheless, it might 
provide some useful hints for the construction of potential linear models of the 
solution for some applications. Moreover, when the object /? that one would like 
to estimate represents some object among a finite library of objects, then span- 
ning such a library once allows to construct potential linear estimators of the 
solution among which the ideal one exists. For instance, in recognition or track- 
ing tasks in computer vision, the unknown (3 might model a given physical object 
of interest that was previously extracted from some image for example by using 
any segmentation method, and one wants actually to recognize it among a finite 
library of objects O = {0„i, in e M}. Then to each object 0,„, m £ M, one as- 
sociates a dimensional vector of features (3m (for instance, some discretization 
of the contour of Om), matrix X stands for some linear geometric transforma- 
tion (rigid, affine, projective, etc.) applied onto the object, Xf3 represents the 
object f3 after deformation and Rz represents noise (and probably the errors 
due to the linear approximation of the deformation). Hence, one might consider 
the finite collection of linear estimators of (3 as follows: {(3m = '^my,fn E Ai} 

with (3miXPmV(^{Xf3m){X(3mV + -R^^) ^r aU m e M, and use the 

model selection procedure that we present in the sequel to select the best one 
i.e., to recognize the actual object. 

We should point out that the same approach might be used also in estimation; 
actually when one has a-priori a finite collection of profiles of the vector of the 
coefficients of the solution with respect to some orthonormal basis (or a fam- 
ily of orthonormal bases). These profiles may be obtained, for example, if one 
knows that the solution belongs to some class of p— dimensional vectors (as- 
sumed to be some discretization of continuous functions) such that the vector 
of the coefficients of each vector u of this class with respect to some p hy p 
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orthogonal matrix can be bounded sharply by some parametric function (a 
p— dimensional vector actually) fm'ifJ-iu)) which parameter fj.(u) should be of 
much lower dimension than p of course (typically 2 or 3). In this case, one may 
proceed by instantiating a finite collection of instances of the parameter ^(u) as 
follows {/im, TO G A^}, and constructs a finite family of linear estimators of /3 as 

follows := *™,TO e M}, with := P„,{X P^)^ (^{X P^KX f3^f + RR^y 

and f3,n = '^J^JifJ-m) for all to e TW. 

3. Data-driven model selection for linear regression and linear 
inverse problems 

Keeping in mind the linear regression model (1.1) and the collection of linear 
estimators of f3 introduced in (2.1), we suggest to devise a data-driven model 
selection criterion that can select a good estimator of /3 among such a collection 
of its linear estimators, and we measure the performance of any estimator /3 of 
f3 with its quadratic risk given by E[||/3 — /3|p] . 

Clearly, when matrix X is rank deficient, i.e., its rank is inferior than p, 
model (1.1) is not identifiable. In this case, one needs to put some assumptions 
on the solution (3 to guarantee identifiability of the model and to make of its 
statistical inversion a possible question. Therefore, our identifiability hypothesis 
in this body of work consists in saying that one knows a-priori some linear 
operator /C-called a noiseless reconstructor of P~ such that one can uniquely 
recover the unknown vector quantity of interest /3 from one noiseless observation 
of model (1.1) (i.e., if matrix R in (1.1) was zero) by using a linear formula of 
the form 

(3 = K.X(3 (3.1) 

This shall be referred as the linear identifiability condition. The motivations for 
imposing an identifiability condition of type (3.1) on model (1.1) are presented 
in details in section 4. Please note that without any prior on /3, one can still 
write (3 = 1CX(3 + /C/3, with JC = X^ and K, = Ip — X^X. In this case, only 

the component of (3 that belongs to the subspace S = ^X^^Xt^t e R^|, in 

other words n = X^X(3 (which is also the solution of min^g^) could 

be theoretically recovered, and the left part of (3 i.e., jl = {I.p — X^X)(3 which 
belongs to the subspace R*^ — S hence is lost. Though it is generally of little 
use in practice, one would recover such a least-norm estimate of the solution of 
(4.1) by writing the new model y = Xfi + Rz, and one has fi which obeys the 
linear identifiability condition with K, = since one can write /x = X^ Xfi (see 
also section 4 for more details). In section 4, we shall describe some situations 
which can lead to derive useful expressions of the noiseless reconstructor JC of 
P to allow recovery of solutions other than such a least-norm solution. 

With this being said, let us now focus on the model selection problem we 
stated previously in section 2, and let us start by deriving for all to e the 
expression of the quadratic risk of an estimator of /?, for all m Cz A4. So, one 
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fixes some m ^ Ai, and from (1.1) and (2.1) one has 
hence 

/3™-/3= (*,n^-/)/3 + *™i?2 
then, by rising both sides of the latter equality to power of two, one obtains 



+ 2/3^(*„,X-/)^*,„i?z 



Finally, by applying the expectation operator on both sides of the latter equality, 
one derives the following expression of the quadratic risk of 

- = p'^i^mX - Ipf{^mX - Ip) f3 + tr (i?^*^*„,i?) 

or equivalently 



E 



Formula (3.2) says that the quadratic risk of /3m decomposes as the sum of two 
terms; a bias term: 

||/3f + p'^X^^'l-^mXP ~ 2p'^^mXP 

and a variance term: 

and the best estimator of P denoted by Pm* for some m* G is definitely the 
one that achieves the best bias- variance tradeoff in the formula of the quadratic 
risk (3.2); or equivalently (since ||/3||^ is a constant), the one that minimizes over 
M the expression 

P'^X^^l^^„,XP-2p^^^Xp+\\^,nR\\^ (3.3) 

Unfortunately, since expression (3.3) that one ideally would like to minimize 
over M depends on the unknown P itself, hence such a minimization cannot 
be carried out in practice. Thus, we refer to the optimal but the unacccssible 
procedure that minimizes (3.3) over Ai as an oracle, and the latter shall serve 
as a benchmark for assessing an estimator's performance. 

To the impossibility of minimizing exactly the expression of the quadratic risk 
adds, unfortunately, another more challenging difficulty which has to do directly 
with identifiability of model (1.1)^. Indeed, when the rank of matrix X is inferior 
than the size of the solution P (p), only the linear transformation XP of p can 



'^Obviously, as explained earlier in this paper, when Xp is the quantity which one looks to 
estimate (i.e., in the regression case), model identifiability is not generally a concern. 
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be observed (up to noise), and because of the term 0^'^raXj3 in (3.3) which 
doesn't depends upon (3 only through Xfi, this makes it quite challenging to 
derive a satisfactory data-driven model selection procedure. Nonetheless, with 
the identifiability condition that we introduced earlier in this section, such a 
model selection procedure becomes possible. Indeed, assume that one can write 
(3 = ICXP, one deduces the following new formula of the quadratic risk 

hence formula (3.3) that one would like to minimize over M becomes 

{Xpfhl^-^,n - 2IC^^„^)xP+\\^,nRf 



The latter formula has the advantage to depend on /3 only through Xf3 which, as 
we shall see in the remainder, makes it now possible to derive useful data-driven 
model selection procedures. 



3.1. Model selection 

Now, we propose to select an estimator of /? among its finite set of linear esti- 
mators (2.1) by minimizing a penalized criterion of the form 

Crit(m) = (*^*™ - 2IC'^^„,)y + pen(m) (3.4) 

where pen(m) stands for an arbitrary penalty function which good choice shall 
be addressed later in this section. The estimator of /3 denoted hy $ = Pm that 
minimizes criterion (3.4) shall be referred as the penalized estimator of P, and its 

quadratic risk E — /3||^ shall be referred as the penalized risk. The following 

theorem provides then some useful clues that help choose the penalty pen(77i) 
in (3.4). 

Theorem 1. Assume model (1-1) along with the collection of linear estimators 
of j3 introduced in (2.1), and assume equality (3.1). Let us fix some positive 
number (0, 1), and consider for all m G M the matrices 

An ^R^{- + -^C^^m + R 

and denote by the largest positive eigen value of the symmetric matrix Am 
and by r*^ the largest singular value of matrix Bm • Let us now associate to each 
model m € M two real positive numbers L,„ and h„i with the Lm 's satisfying 
E = X^msTM '^-'^pI^^™] ^ '^^'^ consider for all m E Ai the quantity Qm 
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\ A W r* \2 

\^rn.\\ __ . '_m . ^-^ ^ 



m G 



for some positive number X. It follows that for every penalty function pen(m), 
M., the corresponding penalized estimator (3 satisfies 

{l-e)E\\\p~(3f] < inf |||/3||2+/3^X^^'^^'™X/3-2/3^*„X/3+|l*™i?|l2 

L J niEM I 

- 2tr(i?^/C^^'„i?) +pen(m)| + sup [Qm - pen{m)) 
The proof of this theorem is deferred to the appendix section. 



2S 

T 



3.2. Equivalent of the model selection result in the regression 
framework 

Though we considered the inverse problem framework and we argued that this 
was w.l.g., however before commenting in details theorem 1, we would like to give 
its equivalent in the regression framework as this might help make our approach 
as much clear as possible. Another motivation that we would like to emphasize 
and which we had already the opportunity to discuss in the beginning of this 
section is that, in contrast to the inverse problem framework, the identifiability 
assumption about model (1.1) is not necessary in the regression framework to 
use the proposed model selection approach. Indeed, such a linear identifiability 
condition might turn to be difficult to establish in some applications, however, 
one might use instead some priors about the solution (see the examples in section 
2) so as to enhance identification of the solution, and feel rather safe to use the 
predictive risk as an estimator's measure of performance. 

Now, consider the finite collection of linear estimators of (3 given in (2.1), 
then one establishes the following formula of the predictive risk of /3„i, for all 
m & M 



E X||/3,„-X/3„ 



X/3||^+(X/3)'^'(<,X^X*,„-2X*,„)(X/3) + ||X^',„i? 



and one selects the model of [3 by minimizing over Ai a penalized criterion of 
the form 

Critp,.ed(m) = z/(*^X^X*,„ - 2X-^„,)y + ^en.p^^^{m) (3.5) 

for an arbitrary penalty pen?""'^''(m), to G M.. Let us denote by /3?""<^'' = the 
estimator of (3 that minimizes (3.5) over A1, then the following theorem provides 
an upper bound of the predictive risk of pP^'^'^. 
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Theorem 2. Assume the linear model (1-1), along with the finite collection 
linear estimators of /3 introduced in (2.1). Let us fix some positive number € 
(0, 1), and consider for all m ^ M the matrices 

and denote by s+ the largest positive eigen value of the symmetric matrix A^,^"^ 
and by the largest singular value of matrix B^'^'^. Let us now associate to 
each model m G Ai two real positive numbers Lf^^'^ and h^^'^ with the L^^'^ 's 
satisfying YP'^'^'^ — X^mGM ^■^P[~-^m'^''] ^ '^"■'^ consider for m £ Ai the 
quantity 



-^7n f't-n 4- \ ^ 



y^lZ^+^LPr' + hZ'' ^ 



for some positive number A. It then follows that for every penalty function 
penp^^j^{m) for all m £ M, the selected estimator pp^^'^ satisfies 

{l-e)E\\\XpP''^'^-Xp\\''] < inf \\\Xp\\^+{Xpf(^l^X^X^^-2X^,„){Xp) 
+ \\X^mR\\^-2tr(^R^X^rnR)+penp,,aim)'j+ sup {Qf;^''-penp,,rf(m)} 



Proof. Theorem 2 is a simple consequence of theorem 1 by taking into account 
the remarks above about the fact that considering the quadratic risk as a perfor- 
mance measure of an estimator X(3 oi Xj3 amounts to considering the predictive 
risk as a performance measure of an estimator (3 oi (3. □ 

As such a model selection result in the case of the predictive risk is a special 
case of the more general result in the inverse problem framework, therefore all 
our comments below, except the identifiability issue, apply to such a regression 
framework as well. 



3.3. Commenting the choice of the penalty 

Theorem 1 suggests to take the penalty function, at least for models m for which 
the quantity Qm is very large, such that pen(m) > Qm and choose the L^'s in 
such a way to get the value of the quantity ^ very small in order to achieve a 
reasonable value of the upper bound of the penalized quadratic risk. Hence, in 
order to see the performance of the proposed model selection approach, one has 
to choose accordingly the penalty function and its constants, and let us assume 
in the remainder that the solution j3 obeys the linear identifiability condition. 
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Thus, we propose to choose the penalty function for all m G M such that 
pen(?T!,) = Qrm in other words, one puts for all to G 



pen(TO) := 2tr(^R'X' ^,nRj - e\\^raRf + 2(- 

+ 2\\A,n\\y^L„, + hrn+ " 



^m)-^ra 



h 



111 



(3.6) 



Theorem 1 then guarantees that the penalized estimator /3 obtained with such 
a choice of the penalty satisfies 



(1-0)E 11/3 



< inf 

meM 



2 + p'^X^-^l;^,^XP - 2P^^rnXp 



(1 - e)\\-^,nRf + 2(-^ + S+)L„, + 2\\A,n\\ VL,n + h. 



+ A 



2S 

T 



However, it is still difficult to sec in the latter formula how the performance 
of the proposed penalized estimator compares with the lowest value of the pre- 
dictive risk over M which is attained only by an oracle of course. Hence, the 
following proposition provides an interesting strategy of choice of the constants 
in the penalty function (3.6) which enforces an oracle inequality. 

Proposition 2. Assume a penalty of the form (3.6). It follows that if one 
defines for all m € M the Lm's, for some positive sequence irmm G A4 (these 
shall be referred as the model weights) and for some a G (0, 1], as follows 



C-^ + s+)||vl/,„i?||4 + 4a^A^\\^{\\M>^R\\^ ■ 
and defines, for some positive number e the km 's as follows 



eH'-^ + s:.y 



■1„ 



L>e(; 



with 



I'frnRW 



id takes 



fili 

V e 

A := 



s+)||vE',„i?||4 + 4a2||A,„P|lvI/„,i?|j2 
V2S^ 



{( 



} 



then the penalized estimator [3 satisfies 



E 



< 



a 



T n,i2l 2^/2(1 + e)Eir ( fr*^ 
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The proof of proposition 2 shall be given in the form of a commentary of 
theorem 1 following our choice of the penalty (3.6). 

So, let us assume a penalty of the form (3.6), and choose the L,n's for some 
positive sequence of model weights ^m, m S and for some positive number a 
(one can assume a € (0, 1] w.l.g.), as follows 



+ S+)ll*mi?r + 4a2j|A^||2(j|vI.„i?||2 + 

and by noticing that 



and 



one obtains the new upper bound of the penalized risk 



{1 ~ 9)e\\\P ~ (3f] < mi |||/3||2+/3^X^*^^^'„X/3-2/3^*„X/3- 



(> 



Now, if one puts 



m I \ \ ^ m 



A := 



*mi?lr + A 



+ V Lm + hr, 



r* N 2 



sup„eAi { (l7f= + ^ + «™) } 



one finds that /? satisfies 



E 

1 - 



|/3-/3fl <-\ inf |||/3||2 + /?^X^vi/^vi/„x/3-2/3^vi/„X/3 

J 1 — t/ meM L 



{l-6+lra+-^JJ^)\\^mRf] + p^ \ SUp |f 



ni^Ai ^ ^ 2y iv^j 



2E 

T 



)'}■■ 



(3.7) 



which recalls the oracle equality (3.3) up to the per- model multiplicative quan- 
tity (j- — + £jn -\- "^V^mj a-nd the additive constant which is given by 



9, a,/:) := - 



2V2 



sup 



{( 



eM ^ ^2\/Lr, 



and which can be made universal at the price of augmenting the i^m's. It is 
interesting to note that one can see the quantity {\ — 9 + ljn + '^V^^ W^mRW^ 
as the difficulty of representing the solution [3 by using model to, expressed 
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in different words, one would say that such a quantity measures somehow the 
complexity of a given model m with respect to the model collection Ai. 

Regarding now the choice of the model weights im, one has to distinguish 
generally between the two small collection of models and a large col- 

lection of models. To start, if one assumes a small collection of models (say, a 
few dozens), then one may take them to be fix, i.e., im '■= (■ for all m S while 
guaranteeing that for a reasonable choice of the constant £, the value of S can 
be bounded by a small enough constant C. In this case, one obtains over the 
set ^ of all collections of estimators such that ^ = {£', s.t. r(0, a, C) < C} 
a near-oracle performance of the form 



E 



< K inf E 



\Pr,^-|3\\ 



c 



1-e 



with K = K{B,a,C) = 

Now, assume a large family of models, then taking the ^m's as a fixed value 
for all TO € A1 may result in an inflation of the upper bound of the penalized 
risk. To account for this situation, let us first define the following quantity for 
all m e M 



+ S+n)\\^„^Rr + 4a2p„i|2(||vl/,„i?||2 + /j„J 



which can be seen, in some sense, as a measure of a model's complexity, hence one 
can write Lm — ^m^m- To start in exhibiting the intuition for the subsequent 
choice of the £m's, let us fix ideas by assuming, for the moment, that can 
take its values in an ordered set of discrete values, and for simplicity, one can 
assume that such values are the integers in the interval [Amm, Amaa;] . Let us 
also denote by card^(A) = X^mex Ia„=a, and put Lm := ^a^, for all to € 
which makes sense since one would like to roughly privilege among models with 
the same complexity the one(s) which make(s) a difference (i.e., achieves the 
lowest value) in the bias term of the penalized risk. Hence one finds that 



S = ^exp[-4„Am]= ^ card^(A)cxp [- ^aA] 

m£M A=Amin 
A„ax 



= ^ exp log [card>i(A)] — ^aA 



A=A„ 



then, if one takes £a 



log [cardAl(A)] 
A 



interestingly, one finds that S < 



which means that with such a choice of the model weights, one 



exp [-A„,i„i5] 
1 — cxp[— (5] 

can make arbitrarily small the value of S. 

Therefore, one would like to generalize this idea to the actual case in which the 
A„i's do not take discrete values but continuous values instead in the hope of 
bounding sharply the value of S while keeping the expression of the penalized 
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risk reasonable. Then one way to achieve such a goal would be by mimicking 
the discrete quantity cardan (A^) in R"*" by the means of a kernel diffusion for 
instance^. By choosing the latter to be a gaussian kernel with a bandwidth a, 
then one possible expression that mimics interestingly cardjvi(Am) in is 



H exp 



m'GM 



2a2 



One checks that in the discrete case, the latter quantity converges to card» {Am] 
as fj^ ^ 0. One may then take the im's for some positive number S as follows 



log 



Em'GAi exp [ - 



in which case, one finds that 



exp 



(5A, 



nieM 



23^ 



Though it is not generally possible to give a sharp closed-form expression of 
E, one has S which is overwhelmingly bounded by X^mGA^ ''■^P [ ^ ^^m] for ^ 
reasonable choice of the constant (i.e., not taking it too small), hence if one 
fixes the value of cr^, a proper setting of the constant S allows to bring down 
enough the value of T{6, a, £), i.e., in such a way to achieve T{6, a,C) < C for 
some small enough constant C. One should choose the constant C to be small 
enough so as not to inflate the additive constant T(6, a, C) in the formula of the 
upper bound of the penalized risk, and big enough in order to keep the model 
weights £m,'m- S -M reasonable. In practical applications, one should take it as 
follows C:=o(||/3||2). 

Concerning the choice of cr, there is not an optimal value of it actually since as a 
increases, the f^'s decrease (desired) resulting in the increase of S (undesired), 
and vice-versa. Nevertheless, there exists some intuitive facts that suggest to 
take (T as follows 



with ^ which can be taken for example as follows /x := 3, and r which is some 
positive quantity which represents in some sense the scale of the quantities 
Am, m G A4, which, therefore, may be taken as follows 



A„ 



In fact, the intuition for such a choice of a is that, if one had to redefine discrete 
values for A^, one would achieve this as follows Am [-^^l ■ r, Vm G M with 



*This is remindful of Kernel Density Estimation (KDE), the difference here is that the 
A™'s are deterministic values . . . 
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[a;] which stands for the integer value of x. Hence by mapping the \A4\ values of 
Am into the discrete interval [1, ■ • ■ ,J3], one realizes in some way a discretization 
of the Am's with a discretization step which is equal to -, hence one retrieves 
somehow the discrete case that we discussed above. It fellows that imitating 
the quantity card^(A) amounts to taking 3(t « ^ (since almost 90% of the 
energy of a gaussian distribution with standard deviation a is concentrated in 
an interval of size ficr around its mean, with /x « 3) . 



Regarding the setting of the /im's, following our choice above of the penalty 
and some of its constants namely the Lm's and A, their role principally is to 
prevent the inflation of the additive constant T{9, a, C) thereby the inflation of 
the upper bound of the penalized risk when there are among the collection M. 
some models with too small weights (which can be seen as nuisance models in 
some sense). So, to address the choice ft-m's, first let us denote by 



-4a2||4 



hence, one may define the hm^s as follows 

II An IP 



l*„,i?ll 



and one has 



one finds that 




2x/2(l+e)I]i [• 



1 - 



sup 

meM 



{(^-<.)} 



One may choose the constant e for instance as follows e := 1; which turns to be 
a good trade-off between keeping the constant T{6, a, C) reasonable on the one 
hand, and not inflating too much the weights on the other hand. 

For the choice of two remaining penalty parameters namely 9 and a, as one 
could notice, their optimal choice strictly speaking doesn't exist as this suggests 
knowledge upfront of the best model. Nevertheless, it is easy to choose them very 
reasonably without prior knowledge of the best model by examining the formula 
of the upper bound of the penalized quadratic risk (3.7). Indeed, firstly for 0, 
formula (3.7) suggests to use values which lie between but far enough from the 
critical values and 1, hence, unless there is a possibility of optimizing the value 
of 9 for a specific apphcation (e.g. by simulation), we suggest to choose it for 
example as follows 6* | which happens to be a good compromise between the 
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bias and the variance term in the formula of the upper bound of the pcnahzcd 
risk. Secondly for a, one can see that increasing a results in the decrease of the 
quantity (l — 6 + im + ^V?m) and the increase of S at the same time. And 
by remarking that the dominant quantity in + -^^/t^ is (m, hence any value 
greater enough than would be adequate for a, hence by default, one may take 
it as follows a := 1 which happens to be a reasonable choice. 

The strategy of choice of the penalty and its constants that we presented in 
this section constitutes of course one possible strategy that turns to enforce an 
oracle inequality of the selected estimator independently of any application, and 
we do not exclude that more optimized penalties for some specific applications- 
by the means for instance of a simulation study that attempts to learn the 
optimal values of the involved constants in the penalty for these applications- 
could be designed. Other papers dedicated more to applications will follow this 
one anyway in which the dear reader can find more hints for optimizing the 
penalty for some specific applications. 

4. Linear identifiability assumption 

Whenever the estimation of the vector /? from (1.1) is the statistician's main 
concern (i.e., the inverse problem framework), and the rank of matrix X is 
inferior than p, one is faced unavoidably to an identifiability problem of model 
(1.1) that needs to be addressed before using the proposed model selection 
approach. To put forward such an issue, let us consider the following noiseless 
linear model 

y' - (4.1) 

and consider the two subspaces of R^: Si = {X^Xt,t E tU'}, and 52 = {(/p — 
X^X)t, t e RP} . Please note that 5i U ^2 = R^, and 5i n ^2 = 0, one deduces 
that all solutions of the form 

/3(r/) = Xty' + (/p - X^X)r^ 

for any p— dimensional real vector ry, satisfy equation (4.1). This simply means 
that, unless the component of f3 that belongs to 52, i.e., (Ip — X^X)/3 is known 
or identifiable with respect to the identifiable component X^Xf3 , one cannot 
generally recover /3 from an instance of (4.1), hence its estimation from an 
observation of model (1.1) is highly problematic. 

Therefore, in order to guarantee identifiability of model (1.1), the hypothesis 
which we put forward in section 4 and which we called the linear identifiability 
condition consists in saying that one knows a priori a linear operator /C that 
we called a noiseless reconstructor of f3 such that one can write j3 = JCX(3 or 
equivalently {Ip — JCX)(3 = 0. Such a linear identifiability condition happens 
in fact to be realistic for various practical applications and some of them are 
discussed below. 

Firstly, in many engineering fields, one commonly assumes that the solution 
p that one would like to recover from an observation of (1.1) is compressible in 
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some basis A of (for instance some wavelet basis (22; 27; 17)); which means 
that if one computed the coefficients of the scalar product between (3 and the 
respective vectors of the basis A, many of these coefficients would be found to 
be zero (or almost zero). We show indeed in this case that under some mild 
assumptions which happen to be realistic for numerous practical applications, 
one may derive easily a possible expression of the noiseless reconstructor K. that 
allows to overcome the identifiability problem of the model. In short, saying that 
f3 is compressible in some basis A means that one is capable of extracting from 
A a subset of vectors which scalar product with /? is fatally zero ^. Hence one 
proceeds by extracting from A a number fc = p — rank(A') of such vectors, then 
use them to construct the rows of a matrix so as to enforce on the solution 
P an equality of the form 0/3 = 0. One checks easily that if matrix satisfies 
simultaneously the two following conditions: 



1. (j)(3 = 

r X 

2. the rank of the augmented matrix ^ 
rows of X and the rows of 0- is equal to p 



-defined by the union of the 



then holds necessarily the following linear relationship between the two vectors 
P and XP: 

p = (x'^x + 4)^4^ x^xp 

Hence, one can define the linear reconstructor /C of /3 as follows 



IC ^ IC{X, (/>)■- (X'^X + (1)'^ 4)) X 



More generally, one shows that for any matrix H such that the rank of the 

' -which rows are the union of the rows of X and the 



augmented matrix 



H 



rows of H- is equal to p then one has the following linear relationship which 
holds for the two vectors P and XP: 



AT^x + n'^n) ^ x'^xp+{^{ip^ x'^x)^^p 



In particular, if one chooses matrix H in such a way to have |jn/3|| <C ||/3||, 
thereby || (n(/p - X'^X)^ViP\\ < ||/3||, then one has 



p w ix^x + n^n) ^x^xp 



Such an idea might be useful when one a-priori knows that the solution P be- 
longs to some subspace of R'' which is given by \t £ W, Gt w t} for instance 
a subspace of the form {t e W, [G - Ip)t < C||tf } with C = o(l) and G 



®When for instance /3 is known to bo smooth, any trigonometric or wavelet basis might do 
the job. 
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standing for some linear operator which is a p by p matrix different enough 
than the p hy p identity matrix; in the sense that it can compensate for the 
rank deficiency of matrix X in the sense that we mentioned above. As an il- 
lustrative example, assume that the solution is smooth, then it is known that 
multiplying it on the left by a smoothing kernel Ga belonging to some family 
of parametric kernels G = {Ga,(J G S} (e.g. gaussian kernels with increasing 
bandwidth cr) would yield approximately the same solution provided evidently 
that the parameter a of is set in such a way not to flatten too much the 
solution. In this case, one would take Ga as the kernel in Q with the smallest 
possible smoothing power corresponding, say, to a parameter a* and such that 
~ X 1 

is of full rank (i.e., p). One could then meet 



the augmented matrix 



Ip ^ Ga 



partially the identifiability condition of model (1.1) with matrix cj) being equal 
to Ip - Ga* ■ 

We mentioned in section 3 that one can recover, by using the proposed model 
selection approach, the least-norm solution p, = X^y' = X^XP of the equation 
y' = X(3, in other words the solution of the optimization problem 

ll/iiP min 

Xtj.=XI3 

since one has y = Xfi + Rz and the linear identifiability condition which is met 
for jjL since one has // = X'^ X(x. One can actually generalize such an idea in 
order to estimate all identifiable solutions of the form 

fi^Il^ min 

Xn = xp 
4>n = o 

for some p hy p positive semi-definite symmetric matrix 11 and some k hy p 
matrix 4> {k < p — rank(X)). Then, one checks easily that, if matrix (ll + 
X"'" X + 4>^ 4>) is of full rank, such a solution /i is unique (i.e., identifiable) and 
it is given by the formula 

= K.XI3 = KXn 

where 

K. = B{X'^ - X'^XAX'^ ~ 4F(t)AX'^) + AX'^ (4.2) 



with 
and 



(n + x'^x 



B = X' X 



In practice, one may approximate formula (4.2) for a great enough positive 
number ^ as follows 
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Such a notion might be useful when /3 belongs for instance to some ^2-body 
(ellipsoid), in other words, one knows some orthonormal matrix $ and a positive 
semi-definite matrix C such that 

($/3)^C(«>/3) < 1 

Hence, one might define for example matrix 11 as follows 

n := $'^C$ 

and matrix ^ might be useful if one knows that some of the components of the 
vector of the coefficients $/3 are fatally zero, in this case, one constructs as 
follows 

with Sk — I($^)j.=o- ^ common example is when the coefficients of P with re- 
spect to some orthonormal basis $ decay rapidly typically like a power law, 
which is the case of most useful practical solutions when the orthonormal basis 
$ is appropriately chosen. 

We would like to add that the author is currently investigating other identi- 
fiability schemes that might broaden the scope of application of the proposed 
method. 

5. A numerical study 

Please, note that our experiments below were realized with the penalty (3.6), 
and its constants were chosen as explained in subsection 3.3. We will show 
the performance of our method on two applications by using the classical per- 

formance ratio p = — r' . ^ r, where £[11/3 — B\\^] is estimated by 

averaging the value of \\(3 — /3|p over 50 instances of the noisy signal y, and 
mmm^M ^[\\0m — is computed directly by using formula (3.2). 
The first application concerns gaussian filtering for smooth signals, and the sec- 
ond application concerns statistical inversion of ill-posed linear inverse problems 
by using regularity (smoothness) and parsimony priors on the solution. 

All our experiments below were performed on the following signal (see fig. 1) 
and for p ~ 100: 

Pit) = ^(^cxp[-t](t/10)V2-f(i/10)log(VlO+l)-f25sin(V5)exp[V50]);Vt = I,--- ,p 
and their Matlab code is available upon mail request to the author. 

5.1. Gaussian smoothing 

We assume an homoscedastic linear regression model 
y{t) = /3{t) + z{t); t = lr-- ,P 
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Table 1 

Some results showing the performance ratio of the proposed method versus the oracle 
performance for the problem of signal gaussian smoothing; for different sizes (M) of the 

model collection. 



Gaussian smoothing by model selection 

M p 

50 1.7321 

100 1.6943 

200 1.6135 

500 1.5640 

1000 1.5875 



where z(t), t = 1, ■ ■ ■ ,p are i.i.d. standard gaussian variables, and we consider 
the collection of AI linear filters {Ga„^,ni ~ I,-- - ,Af} where for all m = 
,M, Gcr„ stands for a gaussian filter {p by p symmetric matrix) with 
bandwidth am defined as follows 



■ exp 



2cr2 



; Vi,j = I,-- - ,P 



where for all m = 1, • • • , M and for all i — 1, ■ ■ ■ ,p, 
tion constant with respect to row i of Go-^ given by 



stands for a normaliza- 



exp 



2ai 



Typically, we take the (Tm's as multiples of the value cr = ^ as follows a„i = 
m ■ (j,m — 1, • • • ,M. The goal is then to select the gaussian filter with the 
best bandwidth to apply on the signal y by using the described model selection 
approach, and the results are summarized in table 1 above. 



5.2. Statistical inversion of ill-posed linear inverse problems 

We assume the following linear inverse problem 

y = XP + z; t=l,--- ,p 

where z = z{t)t=i^... stands for a standard p— dimensional gaussian vector, 
and X is an ill-conditioned p hy p matrix. For our simulations, we generated 
such a matrix X randomly such that for all i, j = 1, • ■ • ,p, one has X{i,j) is 
an i.i.d. standard gaussian variable. To check the ill-conditioning of a randomly 
generated matrix X, we computed the ratio between its largest and smallest 
singular value |-. For the matrix X we used, we found that s* ~ 19.9659 and 

s* = 0.0098, hence |- ~ 2043.7. Now, to recover /3 from a a noisy observation 
y, we used a collection of AI linear filters of the form (see section 2) : 

^m:= (^X^X + n,ny'x^ (5.1) 
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where, for all m = 1, • • • , Af , one has 7i,„ which stands for some linear combi- 
nation of first, second and third order discrete differential operators as follows 

Tim - arniV^V' + br„{Vfv'' + C^{V^fv^ 

where Om, hm and Cm stand for three positive numbers, and I?^, T)^ and 
stand respectively for first, second and third order differential operators given 

by 

[ 0, else. 

and 

so as to achieve different degrees of smoothness in the recovered solution. 
The sequence of the triplets {{am,bm, Cm),m e A^} was generated as follows. 
So, for all m = m{i,j, k) for some i, j, fc S [0, • • • ,9] 

a™ := (2* - 1) ; 6™ {V - 1) ; c„ (2'= - 1) 

so we ended up with a collection of 1000 linear filters of the form (5.1). We 
repeated this experience for 50 instances of the noisy signal y in order for us to 

be able to estimate the performance ratfo p ^ '"["^7^J'^] , between the 

min„eAiE[ll/3r„-/3|Pj 

penalized risk and the oracle risk; so we could estimate p = 4.8936. Clearly, such 
a value of p is about three times greater than in the experiments we showed in 
subsection 5.1 which, at first glance, may look like an underperformance of the 
presented approach. This is not actually the case and this needs to be moderated 
for the reasons that we review here. The first reason is that the problem we 
treated in this subsection is much more complicated then the previous one since 
matrix X is severely ill-conditioned which results in larger penalties, thereby in 
larger upper bounds of the penalized risk. The second reason is that we used 
so strong priors that an oracle was enabled to recover the original signal almost 
perfectly. To see this, we computed for the present experiment the relative error 

ratio ^11 ] g^jj^ found a value of order of 0.2%; which means 

that the oracle succeeded in recovering almost perfectly the actual signal /3. We 
compared such oracle relative error with the one of our method, and we noticed 
that the latter could recover the solution with a relative error of order less than 
1% which is not that bad (see figure 2 for visual assessment of the recovered 
solution). Please, note that if one wanted to recover the solution in the present 
experiment by using direct inversion of the linear model as follows X~^y, then 

one's expected value the relative error ratio, i.e., \\I3\\'^ would be of the 
order of 610% which is of course unreasonable from a practical point of view 
(see also figure 2 for a visual constatation). 
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5 - 




10 20 30 40 50 60 70 80 90 100 



Fig 1. An instance of gaussian smoothing of a noisy signal by using the described model 
selection method : the number of models is M = 1000; the oracle selected for this signal 
a* = 2.43 and the method selected u = 4.05. In dashed blue; The original signal (fS) ; In 
black.' The noisy signal (y) ; In green .■ The oracle- driven smoothed signal ($* ) ; In red; 
The data-driven smoothed signal (P). 



4 




1 1 , , , 1 1 , , 1 1 1 

10 20 30 40 50 60 70 80 90 100 

t 



Fig 2. An instance of a statistical inversion of a linear inverse problem with regularity prior on 
the solution by using the proposed model selection approach : In dashed blue; The original 
signal (f3) ; In black; The noisy signal (^) scaled by a factor o/-| to allow better visualization 
; In green ; The oracle-driven estimated solution ($* ) ; In red; The data-driven estimated 
solution (P). 
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6. Conclusion 

We shall conclude the present paper briefly by saying that we presented a new 
model selection framework that addresses the problem of non least-squares lin- 
ear estimation in linear regression and linear inverse problems by using model 
selection via penalization in the spirit of the pioneering works on non-asymptotic 
model selection works of Birgc and Massart mainly (12), and we showed its good 
performance on two practical problems renowned to be difficult to address in 
practical applications. Moreover, we think that with more optimized penalty 
constants (optimized for instance by the means of a simulation study), one 
could achieve even better performance of the proposed approach. We would like 
then to note that other papers dedicated more to applications of the proposed 
approach will follow the present one in which we plan to study more optimized 
penalties for various applications pertaining to our field of expertise (mainly 
signal and image processing applications). 

The present paper constitutes a first attempt by the author to give a satis- 
factory answer to the question of non-least squares estimation in regression and 
inverse problems by model selection. There are, of course, still many challenges 
that we would like to address along with the statistical community in the near 
future before achieving a complete final package of the present approach, among 
which we review some below in the form of quest ion/ answer: 

• Could our results in this paper be improved ? Our answer to this question 
would be by "y^s". Indeed, though we personally think that the concen- 
tration inequality that we made use of to prove the main theorem in this 
paper is sharp (see lemma 7.1), nevertheless, if one could propose sharper 
concentration inequalities that could lead to smaller penalties, one might 
improve on our model selection results in this paper. 

• What happens if one used instead smaller penalties than the one we pro- 
posed in this paper? In fact, one notices that theorem 1 does not forbid 
the use of smaller penalties than penalty (3.6) which we showed to enforce 
an oracle inequality. However, since theorem 1 only gives an upper bound 
of the penalized risk, hence it is difficult to figure out actually what would 
be the behavior of the penalized estimator if one used smaller penalties 
mainly for models with a value of Qm which is too big. This is an open 
question that we will endeavor to solve in the near future hopefully in the 
spirit of the findings of Birge & Massart in (12). 

• When a linear model is underdetermined, is the linear identifiability as- 
sumption really mandatory in order for the model selection procedure to 
work? A rapid answer to this question would be by "y^s", because such an 
identifiability assumption serves in some sense as a compass for the model 
selection procedure to go in the right direction in an attempt to recover 
the sought solution of interest (among an infinite number of candidate 
solutions), which turns to be rather subjective because it is imposed by 
the user. Nevertheless, we do not exclude that when some generic assump- 
tions can be made on the solution of interest (by assuming for instance 
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that it belongs to some restrictive class of p— dimensional vectors), one 
might devise for example some (implicit) identifiability schemes that can 
achieve comparable results to those in this paper. 

• When noise matrix R is unknown, how one should modify the penalty 
function to allow simultaneous estimation of f3 and i?? We believe that 
this could be done in the near future for example in the spirit of the 
recent work of Baraud and collaborators in (5). 

• What is the convergence rate of the proposed estimation approach with 
respect to some classes of the vector f3 (e.g. a Sobolev body)? We did not 
answer yet this question in the present paper, however, we are currently 
investing a significant amount of our time trying to answer this question 
and many other theoretical questions which could be of significant interest 
either from a theoretical or from a practical point of view. 

By saying this, we concluded then this paper. 
7. Appendix section 

Appendix 1 : Proof of the main theorem 

Proof. We shall use sometimes in the proof the fact that 2ab < 770^+— , Va, b,ri > 
0. 

Let us fix some m E A4. One has one the one hand 

and by using the fact that (3 — ICXP, one finds that 

11/3™ ~pf = !|/3|p + (X/3)^(^'^*,„ - 2/C^*,„)x/3 + z'^R^'^l^mRz 

+ 2(XPf{'fm-ICf'frnRz 

On the other hand, one has 

Crit(m) = 2/^(*^*™ - 2/C^*™)y + pen(m) 

hence 

Crit(m) = (X/3)^(*^,*,„ - 2/C^*™)x/3 + z'^R^(^^l^„^ - 2IC^^m)Rz 
+ 2(X/3)^(*^*„ - /C^*„ - ^-^/C) Rz + pen(m) 

One derives that 

||/3m-/3|p-Crit(m) = |l/3f + 2z^i?'^/C'^*,„i?z + 2(X/3)^*^/Ci?z-pen(m) 
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Now, let US fix some 9 e (0, 1), one finds tliat 

+ 2(X/3)^(«'^/C - e^l^'i>,n + eiC'^^,n)Rz + \\f3f - pen(m) 

hence 

(l-0)||/3„-/3||2-Crit(m) = -0/3^ (*„,X-/p)^ (*,„X-/p)/3+2^i?^ (2/C^«'„,-0*^*,„) i?z 
+ 2/3^(x^1'^,/C - 0X'^^l^^„, + e^,n)Rz + ||/3f - pen(m) 

and by adding and subtracting tlie quantity 20^JCRz in the left side of the latter 
equality, one finds that 

(1-0)||A„-/3||2-Crit(m) = -0/3^(*„X-/p)^(*„X-/p)/3+z^i?^(2/C^«'„-0*^*™)i?z 
- 2(3^ {^rnX - [e-^rn - Rz + + 2(3'^ICRz - pen(m) 
If now one puts = (j^mX — Ip)P for all m e A^, one derives 

(1 - e)\\(3„. - - Crit(m) = -B\\r^,nf + z^i?^ (2/C^*,„ - 0<„*,„)i?0 
- 2ri!^{6^,n ^ICjRz+ + 2(3^ICRz - pen(m) 

and by definition of P — Pm, one finds that 

(1 - e)\\p - i3f = -e\\rjrn\\^ + - e^l^m + 2/c^*™)i?z 

- 2r]'^ (^O'^rh -JCjRz- pen(?n) 
+ inf |||/3||2 + z/f*^^^'„ -2/C'^«'„0y + 2/3^/Ci?z + pen(m)| 

Since m is random and can be any m g Al, it follows that in order to control 
11^ — /3|p, one needs to control uniformly, i.e., for all m G TVJ simultaneously, 
the expression 

r„, - z^R^ ( - + 2/C^*„) Rz ~ 2rii (^^^ - /c) Rz 

To this end, we shall make use of concentration inequality (7.1) of lemma (7.1). 
So, let us put 

An =R^{- + -'C^*™ + 

B,„ = (e-^rr. - /C) RR^ (e-^ra - /C) ^ 
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and denote by the largest positive eigen value of the symmetric matrix Am, 
and the largest singular value of matrix Bm- Then, by using concentration 
inequality (7.1) of lemma (7.1), one has for al\m G M, with a probability larger 
than 1 — exp[— Xm], with Xm > for all m G M, that 



2^2 J iZBmV 



< tr{Am)+2\\A. 

< tr{Ara)+2\\A 

< tr{Am 



In 



1 



2 1 1 A„i 1 1 V Xm + 7ni'"mll'7ni|P + 2( h S^)x„ 

7" 



for every positive number jm- If one takes r^jm = ^, hence A- — Im^ one 
then finds that, simultaneously for all m E M, with a probability larger than 
1 - J2meM expha^m] that 



(1 - 0)11^ - /3f < tr{Am) + 2\\Am\\V^^ + 2C-f- + sr)xm - pen(m) 

a 

y^.'^Hm - 2/C'^«'„0zj + 2P^ICRz + pen(m)} 



meM I. 



and by putting for an arbitrary positive number ^: Xm ~ Lm + S, for all to € Al, 
one derives that, simultaneously for all to e A^, with a probability larger than 
1 - Sexp[-^] that 



(l-e)0-Pf<tr{Am)+2{f+s±)Am+2\\Am\\^/A~TI+2{^^ 



- inf {ll/5|| 



- 2/C^*™)y + 2/3^/Ci?z + pen(m)} 



We need now to separate ^ from one particular m by deriving for all m G Al a 
sharp upper bound for the expression 



2\\Am\\VL^+2{^ + .s+)^ 



To do this, we consider for m G a positive number /i,„, then one has 



VLm + e - \/i 

A > 



= . Hence, for all 



2\\Am\\VLm+^ + 2{^+S+)^ < 2\\Am\\VLm+hr^ 

WAmW 



ri 



< 2\\Am\\VLm + h„ 
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Now, let US put for all rn e A4 

WA^W 



Q„, = tr{A„^)+2{^+S+)L„^+2\\Arn\\VL,n + K, + \ 

hence with a probability larger than 1 — I]cxp[— ^] that 

e 
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r* , N 2 



771 I '^771 



{1 - 9)\\/3 - < Qra - pcnim) + 



A 



+ inf |||/3||2 + z/(*^^',„-2/C^«'„)y + 2/3^/Ci?z + pen(m)| 

and after integration over all values of ^, one finds that 

2E 



r 1 r "1 z2J 

{l-e)E \\f3^(3\f < sup i Q„ - pen(m) I + — 

L -I mGM ^ J A 

+ inf \\\|3\\^+p'^x'^^l^^^>„^xp-2p^^^>„^xp+\\^^>„^R\\^-2tr(R^lC'^^„^R^ 

meM I \ / J 



ie 
since 



E 



inf {WPf + y^f*^^'™ - 2/C^*,„)y + 2p'^K.Rz + pen(m)| 

^eM L V J 

C inf (E[||/3||2+y^(*f„*„-2/C^*„0y + 2/3^/Ci?z + pen(m) 

meM y I \ 

inf |||/3||2+/3^X^^'f„*,„X/3-2/3^*„X/3+!|*„i?f-2trfi?^/C^*„,i?)+pcn(m)| 



□ 



Appendix 2 : A useful concentration inequality and its proof 

Lemma 7.1. Consider the random process: T = Az + iP" z, where A is p 
by p real square matrix, b is a p— dimensional real vector, and z = {zk)k=i,p 
is a p— dimensional standard gaussian vector, i.e., Zk,k — l,p are i.i.d. zero- 
mean gaussian variables with standard deviation 1. let us denote by Sk,k = l,p 
the respective eigen values of the symmetric matrix ■^{A + A'^^ , and let us put 
s+ = sup{sup^^]^ ... _p{sfc}, 0}, and s~ = sup{sup^.^]^ ... p{— s^}, 0}. Then, the 
following two concentration inequalities hold true for all x > 



T > tr{A) + 2^^\\A + AT\\^ + ^\\b\\'^^ + 2s+x < cxp[-x] 



(7.1) 



T < tr{A) - 2\/i||A + ATp + ^\\bW^yG-2s-x 



< exp[-a;] (7.2) 
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Proof. We already proved this lemma in (9), so we redo such a proof here. We 
shall make use of the result of lemma (7.2) below to prove lemma (7.1). To do 
this, let us consider the random process T = z'^ Az + b'^ z, that one can rewrite as 
follows: T = lz'^{A + A^)z + }Fz, then by using the eigen value decomposition 
of the symmetric matrix + A-^), one derives T = X]fc=i ■^fc^'fe + where 
Sfe, k = l,p stand for the respective eigen values of \{A-\- A^), z' = z with 
U standing for the (orthonormal) eigen matrix of + , and b' = U^b. By 
noticing that z' stands for a p— dimensional standard gaussian vector, \\b'\\^ = 

so by applying lemma 
□ 



fc=i 



= tr{A), and ELi 4 = + 
(7.2), the proof of lemma (7.1) follows immediately. 

Lemma 7.2. Let a — {ak)k=i,p and b — (bk)k=i,p be two p~ dimensional real 
vectors, and consider the following random expression : T — X]fc=i "^fc 
where Zk,k = 1, ■ ■ ■ ,p are i.i.d. N{0, 1), and let us put : a+ = supjsupj. 
a~ = supjsup^^]^ ... p{—ak},0}- Then the following two concentration inequali 
ties hold true for all real positive x : 



t + hzk, 
i,...,p{afe},0}, 



T>J2ak 

k=l 



fc=i 



ix + 2a+a; 



< exp[ 



(7.3) 



fe=i 



k=\ 



2a X 



< exp[— x] 



(7.4) 



Proof. This lemma was also proven in (9), so we redo the proof here. We shall 
make use of lemma 7.3 to prove lemma (7.2). Now. to prove lemma (7.2), first, 
one can notice that concentration inequality (7.4) can be obtained from (7.3) 
by considering the random quantity 



p 

r = -T = J2i-cik)zl + i~ 

k=l 



-bk)zk 



and by applying (7.3) on T' instead of T. So, we need to prove only (7.3). To 
do this, let us rewrite T as follows: T = J2k=i "^k, where Tk = OkZ^ + bkZk, and 
let us compute log [E(exp(?;(r — T)))] , where T ~ J2k=i '^k. We have 



E[cxp(yTfe)] 



^/2^7- 



exp 



-Ul ~ 2aky)t^ - 2ybkt) 



dt 



E[exp[yTk]] = exp 



1 - 2aky 



/27r 



exp 



i(v/l-2afcyt- 



bky 



dt 



E[exp[2;rfe]] = 



exp 


r 'ly-' 1 

2 y 






- 2aky 
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E^cxp[y(Tfc-afc)] 
log (E[exp[y{Tk - ak)] 



exp 


r (.2 n 

k 2 


exp[-yak] 


l-2aky 




VI- 


2aky 



1 - 2aky 2 



- ^ log ( 1 - 2afc?/) - aky 



Then, by putting a+ = sup { supj,^]^ ... p{ak}, O}, one derives (see the technical 
details below) that for all < y < 



log (E[exp[y{Tk - ak)] 



-,2 I £fc^„2 



< 



^fe ^ T 



2/ 



1 - 2a+y 

which implies by independence that for all < y < 



log E 



exp 



log (E[exp[y(r-r)] 



< 



1 - 2a+y 



Finally, by applying Lemma (??) below with u = \/X]fc=i('^fe + ^) ; ^ 
2a+ , one derives that for all a; > : 



T> [J] a,] +2, 



fc=i 



\ X! (ofe + y) Va; + 2a+x < exp[-a;] 
\ fc=i 



This terminates the proof of lemma (7.2) 

Some additional technical details about the proof of lemma (7.2) 

We will show in here that for all r > 0, a > r, and < y < one has 

— 1 r^y^ 

log(l — 2ry) — ry < 

2 - l-2ay 

and that for all r < 0, for all a > 0, and for all < ?/ < one has 

1 r^y^ 
logfl — 2ry) — ry < 



□ 



(7.5) 



(7.6) 
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Proof, let us start by showing inequality (7.5). To do this, let us consider the 
following function 

1 r^y^ 
frAy) -77 log(l - 2ry) - ry 



2 ' ^/ ^ \-2ay 

One first notices that /r,a(0) = 0, then a sufficient condition for inequality (7.5) 
to hold true is that /r,a(y)' < 0, for all < y < We have 

1 2 2 

f(y) = ^ log(f _ 2ry) -ry+'-^ + - -J2^ 

^ 2 ^' ^ 2a (2a)2 1 - 2a2/ 

one then derives that 

„2 5"^ 

/r,a(y)' 



f - 2ry 2a (f - 2ay)2 

. . y ^ 2r27/ r^y r^y 

" f -2r2/ (l-2ay) (f - 2ay)2 

Ir^aKV) S i„2r2/ {l~2ay) 
and finally since < -jjz^^^^ one deduces that 

^•^^"^^^ - (l-2ay) (l-2ay)-" 
then we have shown (7.5). 

We proceed in the same way as for showing inequality (7.5) to show inequality 
(7.6). So let us consider the following function 

— 1 r^y^ 



9r,a{y) -TT log(l ^ 2ry) - ry - 



2 ' f-2ay 

One first notices that gr.a(O) = 0, then a sufficient condition for inequality (7.6) 
to hold true is to that gr.aiy)' < for all < y < ^5^. One derives that 

, r r 2^ 

9rAy) = " + 2^ " (f - 2ay)2 

2j,2y j„2^ j,2y 



5r,a(y)' = 



1 - 2ry (1 - 2a?;) (1 - 2ay)^ 
2r'^y 2r^y 



9rAy)' ^ i^2ry (1 - 2a2;) 
and finally, since < -jjzyz^^ '^^^ finds that 

2r'^y 2r^y 



9rAy)' < 



(1 - 2ay) (1 - 2ay) 



□ 
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Birge's & Massart concentration inequality 



Lemma 7.3. If a random variable ^ satisfies for some two real positive numbers 
u and V the following inequality : 



logfE exp[y^] 



< Ml^Jorall 0<y< i 
I — vy V 



then 



vx 



< exp[— .T],/or all a; > 



(7.7) 
(7.8) 



The proof of this lemma can be found in (10). 
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